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Abstract 

Although many genetic factors have been successfully identified for human diseases in genome-wide association 
studies, genes discovered to date only account for a small proportion of overall genetic contributions to many 
complex traits. Association studies have difficulty in detecting the remaining true genetic variants that are either 
common variants with weak allelic effects, or rare variants that have strong allelic effects but are weakly associated 
at the population level. In this work, we applied a goodness-of-fit test for detecting sets of common and rare 
variants associated with quantitative or binary traits by using whole genome sequencing data. This test has been 
proved optimal for detecting weak and sparse signals in the literature, which fits the requirements for targeting 
the genetic components of missing heritability. Furthermore, this p value-combining method allows one to 
incorporate different data and/or research results for meta-analysis. The method was used to simultaneously 
analyse the whole genome sequencing and genome-wide association studies data of Genetic Analysis Workshop 
18 for detecting true genetic variants. The results show that goodness-of-fit test is comparable or better than the 
influential sequence kernel association test in many cases. 



Background 

According to the Catalog of Genome- Wide Association 
Studies updated by the National Human Genome 
Research Institute, approximately 7260 single-nucleotide 
polymorphisms (SNPs) have been identified for 770 
traits in 1360 publications as of November 2012. How- 
ever, researchers believe that a significant proportion of 
heritability of many complex traits is still missing [1,2]. 
The remaining genetic variants to be detected are either 
common variants with small allelic effects, or rare var- 
iants with relatively strong allelic effects. In both cases, 
the genetic effects are weak at the population level. 
Furthermore, only a small proportion of the avalanche 
of candidate variants are likely associated with a trait, 
which is a problem closely related to sparse signal dis- 
covery in statistics. It is very challenging to detect weak 
and sparse genetic effects via association. 

To address this issue, we adopt a goodness-of-fit test 
(GOFT) [3] that has been proved to be optimal under a 
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Gaussian means model [4]. That is, the boundary of the 
reliable detection of this method reaches the lowest possi- 
bility among all statistical tests when the signals are weak 
and sparse. Because the Gaussian means model is asymp- 
totically equivalent to regression models [5], the GOFT is 
promising in detecting weak and sparse genetic effects 
through regression model fitting. In this work, we illus- 
trate how to apply the test to whole genome sequencing 
(WGS) data by using the Genetic Analysis Workshop 18 
(GAW18) data. The method is assessed under various rare 
variant collapsing strategies, and compared with the 
sequence kernel association test (SKAT) [6]. Moreover, 
because GOFT is a method combining p values, it has the 
potential to be used as a meta-analysis for incorporating 
data from different sources. We apply GOFT to simulta- 
neously analyse WGS data and genome-wide association 
studies (GWAS) data for detecting genetic loci associated 
with systolic blood pressure (SBP). The results show that 
even without the sophisticated weighting scheme, GOFT 
is comparable to, and sometimes better than, SKAT under 
its best weighting scheme. In addition, at small p value 



© 2014 Yang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons 
Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http:// 
creativecommonsorg/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. 



Yang et al. BMC Proceedings 2014, 8(Suppl 1):S51 
http://www.biomedcentral.eom/1753-6561/8/S1/S51 



Page 2 of 6 



cutoffs, the GOFT meta-analysis provides higher power 
than that when only WGS data is used. 

Methods 

Method 1: sequence kernel association test 

Sequence kernel association test (SKAT) [6] is a super- 
vised and flexible test for the associations between sets of 
genetic variants and a continuous or dichotomous trait. 
Through adjusting the variance of the random effect 
coefficients of the genetic variants, SKAT can consider 
different weights for different variants in contributing to 
the response trait. Typically, the rare variants are 
assigned with larger weights than the common variants 
based on the rare variant-common disease model [7]. We 
use the R package SKAT [8] for the WGS data analysis. 

Method 2: goodness-of-fit test 

The problem of determining the associations between a 
set of genetic variants and the trait can be viewed as a 
multiple hypotheses testing problem. Under the null 
hypothesis that there is no genetic association, the p 
value from each genetic variant follows a uniform (0,1) 
distribution. So testing a group of variants can be consid- 
ered as a GOFT that measures the consistency between 
the empirical distribution of the observed p values and 
the uniform distribution. Here we adopt a GOFT from 
Berk and Jones [3], which was proposed from large devia- 
tion theory, and then was proved optimal in detecting 
weak and sparse signals [4]. Let P(i) < ■ ■ ■ < p(L) be the 
sorted p values from L individual variants and the trait. 

j 

The GOFT statistic is G = 1 ' max L K (j'P(j) ), where, 
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Comparing with many p value-based SNP-set testing 
methods that sum up all p values together in certain for- 
mulas [9], GOFT looks for the most representative p value 
to the SNP set. At the same time, unlike the minimal 
p value method that fixes the smallest p value to represent 
the set, GOFT adapts to the signal pattern through the 
maximization procedure. Such adaptation is critical 
because the p value of a true association is not necessarily 
the minimal value, especially when true associations are 
sparse and weak [4]. Another advantage is that the GOFT 
statistic only requires information from a set of p values to 
work, so it can be flexibly applied to different genetic stu- 
dies based on the corresponding appropriate p values, or 
to meta-analyses that incorporate various data sources. 

A permutation test can be applied to accommodate 
the various sizes of variant sets and the linkage 



disequilibrium structures among the variants. Specifi- 
cally, let G s and G sm , s = 1, . . . , S, m = 1, . . . , M, denote 
the GOFT statistics of the s th genome segment window 
from the original data and from the m th permutation of 
the genotype data, respectively. The empirical p value 
for the s th window is p s = #{G sm > G s ,m = I, ... ,M} /M. 
The number of permutations M = 1000 was used in the 
following data analysis. 

Collapsing of rare variants 

For the association study of complex diseases based on 
WGS data, a major challenge is to address rare variants 
that have weak statistical association as a result of small 
allele frequency. The GOFT is asymptotically optimal 
for weak and sparse signals, and is a right fit in this sce- 
nario. At the same time, because the effects of missense 
rare alleles are mostly in the same deleterious direction 
[10], collapsing the rare variant before GOFT is likely 
more efficient [11]. Furthermore, because common and 
rare variants contribute to complex diseases, it is good 
to combine information from both to facilitate the 
detection of associated genome segments. Following a 
literature work [12], we collapse rare single-nucleotide 
variants (SNVs) that allocate between adjacent common 
SNVs by summation of their genotype. Then the p 
values for associations of both collapsed rare variants 
and common variants are obtained and fed into GOFT 
test statistic in equation (1) to study the overall signifi- 
cance of variant groups. 

Results 

For evaluating the above association tests, we used the 
WGS "dose" file of 1,215,399 SNVs and the GWAS file 
of 65,519 SNVs on chromosome 3 as the genotype data. 
The quantitative trait was the SBP for the 142 indepen- 
dent individuals who had no missing genotype. To 
assess how the SNV group's size may have affected the 
performance of these tests, we split chr3 into segments 
of fixed windows with 1 of 3 widths: 10 kilobase pairs 
(kbp), 100 kbp, and 500 kbp. Respectively at those 3 
levels, the grouping strategy resulted in 19,472, 1950, 
and 391 windows, among which 87, 37, and 20 windows 
contained true SNVs that were either nonsynonymous 
or regulatory to SBP according to the GAW18 simula- 
tion [13]. The true windows and the 200 simulation 
replicates were used to evaluate power and type I error 
rate. We defined a SNV as a rare variant if it had a 
minor allele frequency less than 5%. The knowledge of 
the simulated true SNVs was only used for evaluating 
the power of the association tests, not for designing the 
tests and the data analysis strategies. 

For GOFT, we assessed its type I error rates as esti- 
mated based on the false-positive rate of the 19,385 
false 10 kbp windows on chr3 over a sequence of 
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cutoffs. The type I error rate was well controlled (results 
are available upon request). Figure 1 shows the assess- 
ment for the power of GOFT in detecting overall 
genetic associations, which was estimated by the true- 
positive rate of true-association windows based on 
GAW18 simulation replicate 1. We considered various 
window sizes with and without rare variant collapsing. 
Larger windows provided higher power at large cutoffs, 
but not at the small p values that are often used in prac- 
tice. This is because large windows likely had more 
noise variants, which diluted the signals from true var- 
iants, and thus were harder to get very small p values. 
Meanwhile, rare variant collapsing did help to increase 
the power in general. 

Under the window size of 10 kbp, we assessed SKAT 
with different strategies of weighting variants: flat 
weight, Beta(l, 1), Beta(0.5, 0.5), Beta(l, 25), and logistic 
(0.07, 150). Figure 2 shows the power of detecting the 
87 true 10 kbp windows on chr3 over a variety of p 
value cutoffs. The Beta(l, 25) and logistic weights per- 
formed better for small p value cutoffs. Figure 2 also 
shows that GOFT was similar to the best SKAT setups 
for small p values. In fact, GOFTs had a larger area 
under the curve than SKATs when comparing their 



whole receiver operating characteristic (ROC) curves 
(results are available upon request). 

To study the performance in detection of various pat- 
terns of genetic effects, we compared the power of 
GOFT and logistic-weight SKAT in detecting each of all 
87 true 10-kbp windows on chr3. The power was esti- 
mated by the true-positive rate of a true window among 
200 replicates. There are 3 patterns of comparisons: 
GOFT was better in 42 windows (Figure 3, left panel), 
SKAT was better in 27 windows (Figure 3, middle 
panel), and both were similar. Figure 3 illustrates exam- 
ples of these comparisons based on the ROC curve 
(complete results are available upon request). GOFT 
seems better overall, but the comparison is not very sig- 
nificant (42 vs. 27, with a p value = 0.10). The type I 
error rate was well controlled here (results available 
upon request). 

Because GOFT only requires p values as the input, it 
has the potential to be used in a meta-analysis that 
incorporates data from different studies. Here we evalu- 
ated how much useful information the GWAS data 
could contribute to the WGS study. By mapping the 
"rs" IDs to the chromosome report from dbSNP, we cal- 
culated the p values of 65,519 GWAS SNVs on chr3. 




10,000 with collapsing 
100,000 with collapsing 
500,000 with collapsing 
1 0,000 without collapsing 
1 00,000 without collapsing 
500,000 without collapsing 



Figure 1 Power of GOFT for different window sizes with or without collapsing variants. Power is estimated by the true-positive rate of 
true-association windows on chr3 based on GAW18 simulation replicate 1. 
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Figure 2 Power of GOFT and SKAT under different weighting schemes. Power is estimated by the true-positive rate of 87 true 10-kbp 
windows on chr3. 



On average, 3.4 GWAS SNVs were added into each 
window (approximately a 5% increase). Figure 4 shows 
that the type I error rate control, after adding the 
GWAS SNVs, was still good {left panel) and that adding 
GWAS data helped to improve the power of GOFT in 
detecting true 10-kbp windows on chr3 in general. 



Discussion 

We attempt to address the low power issue of associa- 
tion tests for WGS data from 2 aspects. First, we prefer 
to use tests specially designed for detecting weak and 
sparse genetic effects. For this purpose, the GOFT is 
asymptotically optimal in the sense that its asymptotic 
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detection boundary is one of the lowest boundaries 
among all statistical methods. If signals are weaker or 
sparser than this boundary, no statistical methods 
would work well. Second, we try to borrow information 
from other data sets through meta-analysis. Although 
there has been some debate on how much of total her- 
itability could be explained by GWAS data [7], the 
common agreement is that both common and rare var- 
iants contribute to complex diseases. It is potentially 
helpful to add GWAS data into WGS in order to 
increase the power. Our results show that both 
attempts are promising. 

At the same time, several future works could be con- 
sidered based on the limitations of the current study. 
First, the sample size is likely still small for either veri- 
fying asymptotical results or larger power of detecting 
weak genetic effects simulated in the data. It would be 
nice to further confirm the patterns of comparisons 
among these association methods by simulated and 
real data with much larger sample size. Second, gene- 
based collapsing can be applied as an alternative to the 
window scheme used here. Third, we have applied a 
simple rare variant collapsing process by direct sum- 
ming the genotypes. This collapsing strategy is less 
sophisticated than the weighting strategy of SKAT. In 
fact, GOFT can further incorporate more successful 
collapsing strategies to improve its power, for example, 
by weighting the SNPs, like what SKAT and other 
methods have adopted [14-17]. Lastly, GOFT repre- 
sents a first-stage analysis, which only seeks to answer 
where the associations are located; additional analyses 
could be required to determine the number and exact 
location of causal signals. 



Conclusions 

We adopted a GOFT to WGS data analysis for detecting 
disease-associated genomic segments, and compared it 
to the SKAT by using the GAW18 simulation data with 
SBP l as response. Even without a sophisticated weight- 
ing scheme, in many cases, GOFT is comparable to or 
better than SKAT with the best weighting scheme. 
GOFT can be applied to a combination of GWAS and 
WGS data. Our results show that such meta-analysis 
has the potential to provide higher power over WGS 
data analysis only. In all cases, the power is still low for 
detecting overall heritability under the sample size of 
142 independent individuals for genetic association 
study. 
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